1 Abstract

Currently there are many machine learning models that have been deployed to predict whether if a person will default their loan amount or they will pay it back. These decisions are made using predictive modelling and ML by using various factors of the person, such as their education level, sex, personal status, checking amount, number of savings bonds and many more. The aim of this project to use one of such publicly available datasets, the Statlog - German Credit Risk Dataset which has anonymized data of many customers of a bank, with their personal details and whether if they had defaulted their loan amount or they were good customers and paid back their loan amount. Within this project I aim to first pre-process the data into both user readable and machine readable format, explore the data and derive inferences, and finally use this to predict whether if a person will default their loan or not.

Keywords - German-Credit-Risk, Machine-Learning, Predictive Modelling

2 Introduction

Most banks’ main source of income is from providing loans for their customers. They store people’s money and pay them some interest on that money, and to some other customers they provide a loan for a purpose at a higher interest than before. This margin between the saving interest and loan interest is where banks make most of their money.

But every time a bank provides a loan it is facing a risk of the loan not being paid back. Generally, banks take some type of collateral such as a person’s property. However, most banks would want to even avoid providing a person who will default their loan since they are losing money and time value of money. In that time they could have invested in a loan to a person who will pay their loan.

Therefore, it is crucial to determine whether if a person is a defaulter or someone who will pay back their loan before the bank even provides the loan. In this project I pre-processed the data, then plot graphs using the powerful R Programming Language and the plotly package. Using these graphs I have also derived inferences from these plots and finally use the data to build a machine learning model that predicts whether if a person will be a defaulter or not.

I also wish to make a Shiny web application that takes all of the required data and predicts whether if a person can be provided with a loan or not.

PROJECT: https://github.com/suryasashankgundepudi/german-credit-risk-modelling

SHINY WEB APP: YET TO BE DEVELOPED

2.1 About the Various Aspects of the project

  1. Packages Required
  • RCurl
  • dplyr
  • tidyr
  • descr
  • reshape2
  • ggplot2
  • plotly
  • CatEncoders
  • stringr
  • superml
  1. Inspiration and Acknowledgements
  • I would like to thank the Kaggle Community for helping me review the code and also bringing many ideas for data visualizing techniques. I have taken inspiration from various places and have tried to implement this project in R.
  • I was incentivized to take up this project since I had just learnt R Programming and wanted to learn various data visualization techniques.
  • I would also like to thank the Stackoverflow Community for helping me with some of the most simple yet crucial problems.

3 Dataset Description

This dataset was provided by Dr. Hand Hoffmann from the University of Hamburg (Universit"at Hamburg). It is publicly available for data scientists to use at the UCI MACHINE LEARNING REPOSITOY. The direct link to the dataset, with both numeric and the true data, is at - STATLOG-GERMAN-CREDIT.

The data contains anonymized data of 1000 customers who have either defaulted their bank loan or have paid back their credits duly. It contains 20 attributes, 7 of which are numerical and 13 of which are categorical. These attributes contain relevant information about the customer. They have been listed below:

  • Status of existing checking account (categorical)
  • Duration in month (numerical)
  • Credit history (categorical)
  • Purpose (categorical)
  • Credit amount (numerical)
  • Savings account/bonds (categorical)
  • Present employment since (categorical)
  • Installment rate in percentage of disposable income (numerical)
  • Personal status (categorical)
  • Sex (categorical)
  • Other debtors / guarantors (numerical)
  • Present residence since (categorical)
  • Property (categorical)
  • Age in years (numerical)
  • Other installment plans (categorical)
  • Housing (categorical)
  • Number of existing credits at this bank (numerical)
  • Number of people being liable to provide maintenance for (numerical)
  • Job (categorical)
  • Telephone (numerical)
  • foreign worker (categorical)

The target variable is the outcome or risk taken by the bank. It contains 1 if the risk taken was good and the person was not a defaulter and 0 if the person was a defaulter.

4 Data Pre-Processing

Within this section I will cover some of the basic data-preprocessing techniques I had employed to get to a more understandable and descriptive data.

4.1 Reading the data.

The data was first read from the UCI- machine learning repository using the following chunck of code. The required package for this chunk is RCurl

i saved this data into a new directory for further processing.

The table below shows how the data looks without any kind of pre-processing

The Raw Data.
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173 1 A192 A201 1
A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173 1 A191 A201 2
A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172 2 A191 A201 1
A11 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173 2 A191 A201 1
A11 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173 2 A191 A201 2

4.2 Mutating the data

As you can see the data does not have any type of column names, and the data by itself does not have names, but is instead in the form of various categorical columns.

Since the data by itself cannot be used for any type of exploratory data analysis, I had used a switch case type of code to provide names for each and every data point in the data. For my reference I had used the German.doc provided at the UCI machine learning repository. The file contained a detailed description of each and every attribute and what each value meant. For getting a better understanding of the rudimentary yet robust code I had employed please visit the project page at - German-credit-Risk-Repo,

Now after cleaning the data for exploratory data analysis I was able to get to a more descriptive data:

Clean Data
Checking.Account Duration Credit.History Purpose Credit.Amount Savings.Account.Bonds Present.employee
< 0 6 critical account/other credits existing (not at this bank) Radio or Television 1169 Unkown/No Savings Account Exp >= 7
0 <= Checking < 200 48 existing credits paid back duly till now Radio or Television 5951 Less than 100 1 <= Exp < 4
No Checking account 12 critical account/other credits existing (not at this bank) Education 2096 Less than 100 4 <= Exp <7
< 0 42 existing credits paid back duly till now Furniture/Equipment 7882 Less than 100 4 <= Exp <7
< 0 24 delay in paying off in the past New Car 4870 Less than 100 1 <= Exp < 4

However, for the later part of this project I also had to make sure the data was in a machine readable format. This was completed using the superML package available in R. The following R-Code chunk helps us convert this characted based dataframe into numeric data for predictive modelling.

# Reading the clean data. 
data <- read.csv("data/eda-german-credit.csv")

# Defining a new variable which takes col names of qualitative columns
catColumns <- c("Checking.Account", "Credit.History", "Purpose", 
                "Savings.Account.Bonds", "Present.employee", 
                "Other.Debters", "Property", "Other.Installment.plans", 
                "Housing", "Job", "Telephone", "Foreign.Worker", "Outcome", 
                "Sex", "Personal.Status")


tf_data <- data.frame(data)

# Transforming each qualitative column into numerical labels
for (column in catColumns){
  label <- LabelEncoder$new()
  tf_data[, column] <- label$fit_transform(tf_data[, column])
}

# Saving the data into a new file for later use
write.csv(tf_data, "data/machine-ready-credit.csv", row.names = FALSE)

The data after being converted for machine readable format looked like this. As you would’ve expected the data had no character variables.

Machine Readable Data
Checking.Account Duration Credit.History Purpose Credit.Amount Savings.Account.Bonds Present.employee
0 6 0 0 1169 0 0
1 48 1 0 5951 1 1
2 12 0 1 2096 1 2
0 42 1 2 7882 1 2
0 24 2 3 4870 1 1

5 Exploratory Data Analysis

Since the data is now more descriptive, I attempted to plot various graphs, most of which are interactive to derive inferences. There are major parts of this data analysis module.

  • Gender Analysis
  • Age Analysis
  • Wealth and Job Analysis

Each of these categories aim to provide a better understanding of the distribution of the data across various demographics. There are also some miscellaneous plots I have included, which I thought would help me derive more inferences.

I initially wanted to understand the target variable’s (whether if the loan provided was a good decision or a bad one) distribution. The bar graph shown below lets us understand it better.

From this graph we understand that there is a class imbalance in our target variable. Though the number of people who defaulted their loan is lesser than the customers who paid back their credits duely it is still a high ratio and it is our aim to reduce the number of defaulted loan decisions.

5.1 Gender Analysis

To get an idea of how the population of our dataset was distributed I plotted a histogram that shows the distribution of the age group across the two genders and as a whole.

It is understood that the age group of people who wish to take a loan are in their 20s and 30s. This is irrespective of gender which can be seen in the overall distribution.

In the next plot I plot the reasons why men and women take a loan. To visualize this I have plotted a horizontal grouped bar graph that shows the distribution of men and women across various purposes.

The graph is plotted by taking the percentage of the number of men and women for different purposes, and then plotting them side by side. From this it is inferred that in general, for all categories other than furniture and domestic appliances, there is a higher percentage of men who take a loan than women.

In the next graph I plot the distribution of the credit amount that men and women have in their bank accounts. The x axis plots the amount of money in Deutsche Mark and the y axis plots the count of the same.

It could be hypothesized that the gender of a person does not affect their credit amount and that majority of the population has a credit amount in between 1000 DMK and 2500 DMK.

Finally, for our gender analysis I have attempted to see if gender affects a person’s loan outcome. The next plot shows the count of men and women who have good and bad outcomes respectively.

It could be understood from the above plot that men in general have a higher ratio of good to bad outcome than women. However, the data might not be completely representative of the general population as there is an imbalance between the number of men and women.

Summary of Gender Analysis

  • The age distribution of men and women is extremely similar
  • Gender does not affect the credit amount
  • Males tend to have lower percentage of bad risk than women.

5.2 Age Analysis

In the age analysis module I attempted to see if the various age groups have better or lesser risk. I also try to look at the credit ammount distribution but I do not look at outliers as much in this analysis.

The population has been split into 4 mmajor age groups equaly.

  • Young
  • Young Adult
  • Adult
  • Senior

For our first plot I plotted a stacked histogram with age distribution for people with good and bad credit.

It can be inferred from this graph that majority of the younger population are the people with bad risk. However, the graph is also right skewed for good credit.

However, the age group of people with good credit lie in their late 20s and early 30s.

The next graph is a box plot of various age groups against credit amount. This way I will be able to see if different age groups are more or less rich than the other groups.

Young adults and Adults have a higher credit than other age groups. This also shows that in general people with lesser credit amount have bad outcome.

Another representation of the same is a violin plot as shown below. The violin plot also provides us with similar inferences as the box plot.

Finally I plot a stacked bar graph against good and bad loans for different age groups to understand the ratio of good and bad credits.

From this graph it is understood that young adults have the highest ratio of good to bad risk outcomes. On the other hand, seniors are surprisingly the ones who have the lowest ratio of good to bad risk outcomes.

Summary of Age Analysis

  • Young adults tend make up most of the bad credit population
  • People in their mid life (Adults) make up most of good credit population
  • Adults have the highest ratio of Good to Bad credit
  • Seniors have the lowest ratio of Good to Bad credit.

5.3 Wealth and Job Analysis

5.3.1 Wealth Analysis

Here I try to understand how people from different wealth classes are distributed in our data-set.

Surprisingly people from the higher class, ie with more amount in savings have lower ratio of good to bad outcomes. Also, people from the lower savings sector have a higher ratio of good to bad credit. However, the highest class of people have the highest good to bad credit ratio.

A similar distribution can be seen based on people of different types of credit payment.

5.4 Job Analysis

For the data provided, the job attribute is split into different levels of skill and industry. In this analysis I plotted 2 different plots. One with the different types of job and their credit amount.

It can be postulated from the two graphs above that Self employed or highly qualified professionals have high good and bad outcome. The people with high credit amount are also people in the highly qualified or self employed professionals. It is my opinion that since this part of this attribute incluede self emplyed people, they might take loans for their businesses and these businesses might not have been able to pay back their loan. This might also explain why they have so much credit amount.

Summary of Wealth and Job Analysis

  • People in the middle class seem to have a low ratio of good to bad outcome with respect to risk and the people with highest savings have the highest ratio of good to bad credit.
  • It can also be seen that people with low savings make up most of the population that are asking for loans. This can also be due to the fact that the dataset might be a convenience dataset.
  • With respect to job analysis, people who are self employed take the highest amount of credit in DMK, and they have equal distribution of people with a good and bad outcome.

5.5 Miscellaneous Plots

The following two graphs are plotted majorly look at the distribution of various types of home owners in good and bad outcomes.

We can see that people who live for free have lowest ratio of good to bad outcome for a loan payment, and that home owners have highest ratio of the same.

Finally, to understand why people wish to take up the loan I plotted various box plots. The graph is as shown below.

Though majority of the population who take up loans for other purpose take up the highest amount of money for their loans. It can also be seen that the next type of people to take up loans are the ones who wish to pay their car loans. The people who take up loans for domestic appliances are the ones who take lowest amount for their loan.

This concludes our Exploratory data analysis section. We will now move on to predictive modelling using various machine learning techniques

6 Predictive Modelling

Within this section I employed various machine learning algorithms to classify whether if a person will default their loan or not. For predictive modelling I had employed the Python Programming Language to implement ML algorithms because of its better support for the same.

Some of the algorithms I used are

  • Logistic Regression
  • Random Forest
  • Decision Trees
  • K Nearest Neighbors
  • Linear Discriminant Analysis
  • Quadratic Discriminant Analysis
  • XGBoost

We will be looking at the precision, recall and f1 score for these algorithms. The code for this can be found in the interactive python notebook at the project repository

6.1 Results

NAME.OF.ML.ALGORITHM.USED PRECISION.0 PRECISION.1 RECALL.0 RECALL.1 F1.SCORE.0 F1.SCORE.1
DECISION TREES 0.77 0.39 0.71 0.46 0.74 0.42
LOGISTIC REGRESSION 0.79 0.65 0.91 0.42 0.85 0.51
RANDOM FOREST 0.78 0.66 0.92 0 0.38 0.85 0.48
XGBOOST 0.82 0.69 0.9 0.51 0.86 0.59
QUADRATIC DISCRIMINANT ANALYSIS 0.83 0.57 0.82 0.58 0.82 0.58
SUPPORT VECTOR CLASSIFIERS 0.77 0.66 0.94 0.29 0.84 0.40

Though the results are not as great I hope to implement a fine tuned Deep learning model that provides us with better results.

7 Conclusions

The German Credit data was read from the UCU machine learning repository. Initial data pre-processing was implemented to bring about clean and understandable data. Then the data was used to perform exploratory data analysis and derive inferences. Finally the machine ready data was scaled and used for predicting if a person would default their loans or not. Various machine learning algorithms were used for this purpose and the XGBoost model performed the best compared to other algorithms.

8 Future Work

Right now I am a little busy with my Senior year at college, and I wish to in the future make a deep learning algorithm for this data. We can also implement the trained algorithms for a Shiny Web application. But most of all, I wish to implement the machine learning algorithms using R Programming. Since the data is also not representative one could implement data augmentation to make the data more descriptive.

9 References

  1. UCI Machine learning dataset - [Link](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
  2. Caret - LINK
  3. Kaggle.com LINK